The data have been downloaded from the website and a basic cleaning has been done in the preparation step. Due to the size of the dataset, which is 2.5 Gb for 3 months of data, a sample is obtained in order to explore the data more efficiently.
The preparation of the data include:
Load the libraries.
library(dplyr)
library(reshape2)
library(magrittr)
library(latticeExtra)
library(zoo)
library(ggplot2)
library(raster)
library(sf)
library(RColorBrewer)
library(lubridate)
The sample data is read after the preparation step which includes a basic cleaning and preparation of the data.
data <- read.csv("data/taxi_sample_data.csv")
Besides the records of trips, the website includes a .csv with the taxi zones in New York
zones <- read.csv("data/taxi_zones.csv")
plot(spZones["zone"])
From the shp file, we have also information about boroughs:
borough <- ggplot() + geom_sf(data=spZones, aes(fill = borough))
borough
There is also information about the kind of service area in the .csv
a <- merge(spZones, zones[,c(1,4)], by='LocationID')
service <- ggplot() + geom_sf(data=a, aes(fill = service_zone))
service
summary(data)
## VendorID tpep_pickup_datetime tpep_dropoff_datetime
## Min. :1.000 2017-06-28 20:31:05: 3 2017-03-18 03:32:29: 3
## 1st Qu.:1.000 2017-11-27 20:08:58: 3 2017-06-05 19:30:40: 3
## Median :2.000 2017-03-01 18:04:24: 2 2017-03-01 18:11:50: 2
## Mean :1.549 2017-03-01 19:07:32: 2 2017-03-01 19:20:12: 2
## 3rd Qu.:2.000 2017-03-01 19:38:59: 2 2017-03-01 19:33:17: 2
## Max. :2.000 2017-03-01 21:44:09: 2 2017-03-01 20:13:47: 2
## (Other) :43751 (Other) :43751
## passenger_count trip_distance RatecodeID store_and_fwd_flag
## Min. :1.000 Min. : 0.510 Min. :1.000 N:43598
## 1st Qu.:1.000 1st Qu.: 1.200 1st Qu.:1.000 Y: 167
## Median :1.000 Median : 1.900 Median :1.000
## Mean :1.619 Mean : 2.942 Mean :1.002
## 3rd Qu.:2.000 3rd Qu.: 3.500 3rd Qu.:1.000
## Max. :6.000 Max. :52.200 Max. :4.000
##
## PULocationID DOLocationID payment_type fare_amount
## Min. : 4.0 Min. : 3.0 Min. :1 Min. : 2.50
## 1st Qu.:113.0 1st Qu.:100.0 1st Qu.:1 1st Qu.: 7.00
## Median :161.0 Median :161.0 Median :1 Median : 10.00
## Mean :161.5 Mean :158.4 Mean :1 Mean : 12.71
## 3rd Qu.:231.0 3rd Qu.:234.0 3rd Qu.:1 3rd Qu.: 15.00
## Max. :265.0 Max. :265.0 Max. :1 Max. :245.00
##
## extra mta_tax tip_amount tolls_amount
## Min. :0.5000 Min. :0.5 Min. : 0.010 Min. : 0.0000
## 1st Qu.:0.5000 1st Qu.:0.5 1st Qu.: 1.550 1st Qu.: 0.0000
## Median :0.5000 Median :0.5 Median : 2.150 Median : 0.0000
## Mean :0.6649 Mean :0.5 Mean : 2.765 Mean : 0.2123
## 3rd Qu.:1.0000 3rd Qu.:0.5 3rd Qu.: 3.200 3rd Qu.: 0.0000
## Max. :1.0000 Max. :0.5 Max. :425.470 Max. :37.7600
##
## improvement_surcharge total_amount
## Min. :0.3 Min. : 5.31
## 1st Qu.:0.3 1st Qu.: 10.30
## Median :0.3 Median : 13.56
## Mean :0.3 Mean : 17.16
## 3rd Qu.:0.3 3rd Qu.: 19.80
## Max. :0.3 Max. :445.27
##
Main characteristics seen in summary are related to the numeric variables, where maximum values are much higher than the mean or the median: total amount, tip amount, tolls, fare amount etc. This could be related to outliers or skewed distributions of the data.
We will now explore the numeric variables and look for the main characteristics and data distributions.
# extract from the dataset the numeric variables
df <- data[, c(5,11,14,15,17)]
df <- melt(df)
## No id variables; using all as measure variables
# Theme characteristics for plots.
myTheme <- custom.theme(pch=20, cex=0.5)
myTheme$strip.background$col <- "white"
# Frequency of data in each variable
histogram(~value|variable, data=df, par.settings=myTheme, xlab=" ", scales='free')
The boxplot can also help to see the data distribution. There are some data records which present very high values for the numeric variables. So the distribution does not seem to be normal.
ggplot(df, aes(x=variable,y=value)) +
geom_boxplot(aes(fill=value),outlier.size = 0.1) +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free')
We include new conditions for the numeric variables. The total, fare and tip amount have extremely large values.
We also check that some records have lower fares than tips, so that values are removed.
Aditionally, the tip should be at least 10% of the fare and we also consider only tips below 30% of the fare. The idea is to recommend a tip depending on other variables, but a tip above 30% cannot be recommended.
Remove records with fares lower than tips and tips lower than 10% and above 30%:
data <- data[data$fare_amount > data$tip_amount,]
data <- data[data$tip_amount > 0.1*data$fare_amount,]
data <- data[data$tip_amount < 0.3*data$fare_amount,]
histogram(data$tip_amount, par.settings=myTheme, xlab="tip amount")
We can have a look on the tip amount as a % of the fare amount.
histogram((data$tip_amount/data$fare_amount)*100, par.settings=myTheme, xlab="tip amount")
Most of the tips are between the 20 and 25% of the fare amount.
Another numeric variable that could be relevant is the duration of each trip. We can calculate this variable from the pick up and drop off time.
Time stamp are loaded as factor. Translate into daytime:
data$tpep_pickup_datetime <- as.POSIXct(as.character(data$tpep_pickup_datetime), tz = "GMT", format="%Y-%m-%d %H:%M:%S")
data$tpep_dropoff_datetime <- as.POSIXct(as.character(data$tpep_dropoff_datetime), tz = "GMT",format="%Y-%m-%d %H:%M:%S")
## transform to NY time zone
data$tpep_pickup_datetime <- with_tz(data$tpep_pickup_datetime, tzone = "America/New_York")
data$tpep_dropoff_datetime <- with_tz(data$tpep_dropoff_datetime, tzone = "America/New_York")
## new column with the differences:
## it takes few minutes
data$trip_timelength <- apply(data, 1, FUN=function(x) difftime(x[3], x[2]))
# Plot the new variable
histogram(data$trip_timelength, par.settings=myTheme, xlab="trip time duration", scales='free')
The data have negative values that should be removed
data <- data[data[,"trip_timelength" ] > 5,]
df <- data[, c(5,11,14,15,17,18)]
df <- melt(df)
histogram(data$trip_timelength, par.settings=myTheme, xlab="trip time duration", scales='free')
We can now plot all the numeric variables together again and we will see how the largest values in the fare amount have been removed at the same time that the negative trip duration. The boxplots show us the general statistics of the data, with the median value and the range of the quantiles; we can as well add the density functions in a violin plot, which helps with the shape of the data distributions when it is different from the normal.
# extract from the dataset the numeric variables
df <- data[, c(5,11,14,17,18)]
df$tipPer <- df[,3]/df[,2]*100
df <- melt(df)
## No id variables; using all as measure variables
ggplot(df, aes(x=variable,y=value)) +
geom_violin(trim=FALSE)+
geom_boxplot(width=0.2, outlier.shape=NA, aes(fill=value)) +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free')
We can have a look on the scatterplots of the numerical variables: a linear relationship is seen, but with some stratified characteristics. If we have a look on the tip amount at the center of the scatterplot matrix, we can see that there are some values of constant ‘tip amount’ independently of the fare amount.
pairs(data[,c(5,11,14,17,18)], cex=0.3)
Once the general linear relationship is seen, we can remove some of the outliers.
## First, calculate the quantiles and the IQR in the numeric variables:
quant <- function(x) {
q <- quantile(x, probs=c(.25 ,.75))
iqr <- IQR(x)
up <- q[2] + 1.5 *iqr
down <- q[1] - 1.5 *iqr
c <- c(up,down)
c
}
## apply this function to the 5 relevant numeric variables:
var <- list("trip_distance","fare_amount","tip_amount","tolls_amount","total_amount")
q <- lapply(var, FUN=function(x) quant(data[,x]))
names(q) <- var
## filter the dataset with these conditions:
subsetq <- function(x) {
for (i in 1:length(var)) {
out <- subset(x, x[var[[i]]] < (q[[i]][1]) & x[var[[i]]] > (q[[i]][2]))
}
out
}
data <- subsetq(data)
If we represent again the data:
# extract from the dataset the numeric variables
df <- data[, c(5,11,14,17,18)]
df$tipPer <- df[,3]/df[,2]*100
df <- melt(df)
## No id variables; using all as measure variables
ggplot(df, aes(x=variable,y=value)) +
geom_violin(trim=FALSE)+
geom_boxplot(width=0.2, outlier.shape=NA, aes(fill=value)) +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free')
Another variable that could be relevant is the time of pick up or drop off the taxi. Create new columns with the time (hour) of the pick up and drop off.
d1 <- as.character(data$tpep_pickup_datetime)
d2 <- as.character(data$tpep_dropoff_datetime)
data$t_pu <- hour(data$tpep_pickup_datetime)
data$t_do <- hour(data$tpep_dropoff_datetime)
Plot the frequency of records by hour of bick up.
df <- data[, c("tip_amount","t_pu", "t_do")]
barchart(table(data$t_pu), par.settings=myTheme, xlab=" ", horizontal=FALSE)
It can be seen that there is little amount of data between 2 to 9 a.m compared to the rest of the day. Less taxi trips are made on overnight hours. We see if there is any different on the tip depending on the pick up hour:
df <- df %>%
group_by(t_pu) %>%
summarise(m = median(tip_amount))%>%
as.data.frame()
dotplot(m~as.factor(t_pu), data=df, pch=20, grid=TRUE, cex=2, ylab='median tip', xlab='time', par.settings=myTheme)
Between 0 and 8 in the morning, the median tip amount behaves different, but we should be careful, due to the fact that it is the beriod with less records and the conclusions might not be representative.
The median value of tip (%) is representd for each hour. Again, it is difficult to find a conclusion due to the lack of data in the range of 0 to 9 hours, but highest value is found at 9 a.m and a trend for higher values between 11 to 12.
df <- data[ ,c("t_pu", "t_do")]
df$tipPer <- data[,14]/data[,11]*100
df <- df %>%
group_by(t_pu) %>%
summarise(m = median(tipPer))%>%
as.data.frame()
dotplot(m~as.factor(t_pu), data=df, pch=20, grid=TRUE, cex=2, ylab='median % tip', xlab='time', par.settings=myTheme)
After having seen the numeric variables of the dataset, we take care of the categorical variables that it includes.
which(data$payment_type != 1) # records of payments different than 1
## integer(0)
histogram(data$RatecodeID, par.settings=myTheme)
which(data$RatecodID != 1)
## integer(0)
From the zones .csv there is information about the zones LocationID, Borough and service zone. We create 2 columns in the dataframe with the PU (pick up) and DO (drop off) service and borough zone.
We can see that LocationID have 265 values, but zones only have information about 263 zones. In addition, the shape file has no information on this zones. Due to that, we decide to remove these records before continue:
data <- data[data$PULocationID <= 263,]
data <- data[data$DOLocationID <= 263,]
tail(zones)
## LocationID Borough Zone service_zone
## 260 260 Queens Woodside Boro Zone
## 261 261 Manhattan World Trade Center Yellow Zone
## 262 262 Manhattan Yorkville East Yellow Zone
## 263 263 Manhattan Yorkville West Yellow Zone
## 264 264 Unknown NV N/A
## 265 265 Unknown <NA> N/A
zones <- zones[1:263,]
Create 2 variables for the PU and DO service zone
data$PU_servicezone <- data$PULocationID
data$DO_servicezone <- data$DOLocationID
## replace the ID code with the zone in the zone dataset
repl <- function(x) {
for (i in 1:263) {
x[which(x["PU_servicezone"] == i), "PU_servicezone"] <- as.character(zones[i,"service_zone"])
x[which(x["DO_servicezone"] == i), "DO_servicezone"] <- as.character(zones[i,"service_zone"])
}
x }
data <- repl(data)
data$PU_servicezone <- factor(data$PU_servicezone, levels=c("Airports","Boro Zone", "EWR", "Yellow Zone"))
data$DO_servicezone <- factor(data$DO_servicezone, levels=c("Airports","Boro Zone", "EWR", "Yellow Zone"))
create 2 variables for the PU and DO borough
data$PU_boroughzone <- data$PULocationID
data$DO_boroughzone <- data$DOLocationID
## replace the ID code with the zone in the zone dataset
repl <- function(x) {
for (i in 1:263) {
x[which(x["PU_boroughzone"] == i), "PU_boroughzone"] <- as.character(zones[i,"Borough"])
x[which(x["DO_boroughzone"] == i), "DO_boroughzone"] <- as.character(zones[i,"Borough"])
}
x }
data <- repl(data)
data$PU_boroughzone <- factor(data$PU_boroughzone, levels=c("Bronx", "Brooklyn", "EWR", "Manhattan", "Queens", "Staten Island"))
data$DO_boroughzone <- factor(data$DO_boroughzone, levels=c("Bronx", "Brooklyn", "EWR", "Manhattan", "Queens", "Staten Island"))
We can now plot the frequency of each trip depending on its pick up and drop off area. The next figure shows how most of the trips are picked up in Manhattan, and they finish in the same area. Few trips that started in Manhattan finish in Brooklyn and Queens.
df <- data[,c(23,24)]
barchart(table(df), auto.key=TRUE, par.settings=myTheme,xlab="times", ylab='pick up')
There is also few amount of data records that show trips starting at Queens and Brooklyn, ending in a similar amount in Manhattan, Brooklyn or Queens. So that means that the trip flows are inside or to Manhattan mostly.
The same can be done by service zone, where we have information about the specific location of airports, what could be interesting.
df <- data[,c(21,22)]
barchart(table(df), auto.key=TRUE, par.settings=myTheme,xlab="times", ylab='pick up')
From that graph we see that half of the trips from airports end in the ‘Yellow Zone’ which is mostly the Manhattan area, as we could see in the figures at the beginning of the report.
We can see this variables on a map, taking advance of the higher resolution of the PU and DO location ID:
Number of times that the taxi is pick up by zone
PU <- data %>%
group_by(PULocationID) %>%
summarise(no_rows = length(PULocationID)) %>%
as.data.frame()
names(PU) <- c("LocationID","PU_times")
b <- merge(spZones, PU, by='LocationID')
## logaritmic scale because there is a lot of difference between airports (maximum) and the rest
pu <- ggplot() + geom_sf(data=b, aes(fill = PU_times))+scale_fill_continuous(trans = "log10", type='viridis')
pu
Number of times that the taxi is drop off up by zone
DO <- data %>%
group_by(DOLocationID) %>%
summarise(no_rows = length(DOLocationID))%>%
as.data.frame()
names(DO) <- c("LocationID","DO_times")
c <- merge(spZones, DO, by='LocationID')
do <- ggplot() + geom_sf(data=c, aes(fill = DO_times))+scale_fill_continuous(trans = "log10",type='viridis')
do
As it is the yellow taxi data, most of the trips are in the Manhattan area.
We can create a new categorical variable related to the time of the day in which the trip is made. From the ‘hour’ variable we can detect if it is ‘daytime’ or nigth.
data$t_puH <- data$t_pu
data$t_doH <- data$t_do
data[which(data[ ,"t_puH"] >= 7 & data[ ,"t_puH"] <= 20), 't_puH'] <- "day"
data[which(data[ ,"t_puH"] != "day"), 't_puH'] <- "night"
data[which(data[ ,"t_doH"] >= 7 & data[ ,"t_doH"] <= 20), 't_doH'] <- "day"
data[which(data[ ,"t_doH"] != "day"), 't_doH'] <- "night"
data$t_puH <- as.factor(data$t_puH)
data$t_doH <- as.factor(data$t_doH)
histogram(data$t_puH, par.settings=myTheme)
Most of the trips are made during the daytime hours. The selection of the hours is kind of arbitrary ,and a different classification of ‘day’ and ‘night’ could be also possible.
We consider now the day of the week as well as a categorical variable, in order to see if there is a day in which people take more the taxi or if it influences on the fares/tips etc.
data$weekday <- weekdays(data$tpep_pickup_datetime)
data$weekday <- factor(data$weekday, levels=c("domingo","lunes","martes", "miércoles","jueves","viernes","sábado"))
# plot
barchart(as.factor(data$weekday), par.settings=myTheme, horizontal=FALSE,xlab='day of the week')
Although most of the days have similar amount of records, it is clear that Sunday is the day when there are less taxi trips in the New York city.
We create another categorical variable related to the intra characteristic of the record. We consider trips that pick up and drop off in a different borough as intratrip. Next graph shows how most of the trip records are made in the same borough.
data$intratrip <- seq(1:nrow(data))
data[which(data$PU_boroughzone != data$DO_boroughzone), "intratrip"] <- "yes"
data[which(data$PU_boroughzone == data$DO_boroughzone), "intratrip"] <- "no"
histogram(as.factor(data$intratrip), par.settings=myTheme, xlab='intra-trip')
We can represent numeric variables depending on some categorical variables like: the pick up or drop off area or the day of the week.
## all together:
tip <- melt(data[,c(14,23,24)], 'tip_amount')
trip <- melt(data[,c(5,23,24)], 'trip_distance')
fare <- melt(data[,c(11,23,24)], 'fare_amount')
total <- melt(data[,c(17,23,24)], 'total_amount')
trip_time <- melt(data[,c(18,23,24)], 'trip_timelength')
all <- cbind(trip, tip, total, fare,trip_time)
all <- all[,c(1,4,7,10,13,14,15)]
b <- melt(all)
## Using variable, value as id variables
names(b) <- c("pu_do","borough","variable","value")
b$pu_do <- as.character(b$pu_do)
b[(b$pu_do == "PU_boroughzone"), "pu_do"] <- 'PU'
b[(b$pu_do == "DO_boroughzone"), "pu_do"] <- 'DO'
ggplot(b, aes(x=pu_do,y=value)) +
geom_boxplot(aes(fill=borough),outlier.size = 0.1) + scale_fill_brewer(palette="Paired") +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free', ncol=2) + theme(legend.position="top")
The above picture shows that the are few differences on the data distribution of each variable regardless of the pick up and drop off area. Among areas, we see how the borough of Manhattan is the one with lower fare, tip amount, trip distance and trip duration.
all <- data[,c(5,11,14,18,27)]
all <- melt(all, 'weekday')
ggplot(all, aes(x=weekday,y=value)) +
geom_violin(trim=FALSE)+
geom_boxplot(aes(fill=weekday),outlier.size = 0.1, outlier.shape = NA, width=0.3) + scale_fill_brewer(palette="Paired") +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free', ncol=2)+ theme(legend.position="top")
The figure shows that he sunday is different from the rest of the week, with higher median values from the trip distance. However, the rest of the variables it seems to be higher in the central days of the weeks. Differences are subtle in any case.
trip <- melt(data[,c(5,28)], 'trip_distance')
tip <- melt(data[,c(14,28)], 'tip_amount')
fare <- melt(data[,c(11,28)], 'fare_amount')
total <- melt(data[,c(17,28)], 'total_amount')
trip_time <- melt(data[,c(18,28)], 'trip_timelength')
all <- cbind(trip, tip, total, fare, trip_time)
all <- all[,c(1,4,7,10,13,14,15)]
b <- melt(all)[,2:4]
## Using variable, value as id variables
names(b) <- c("intratrip","variable","value")
ggplot(b, aes(x=intratrip,y=value)) +
geom_boxplot(aes(fill=intratrip),outlier.size = 0.1) + scale_fill_brewer(palette="Paired") +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free')
It seems to be a clear increase in all the numeric variables for when the taxi trip goes from a borough to another. However, if we have a closer look to the tip variable but in percentage:
## tip percentage
tip <- data[,c(11,14,28)]
tip$tipPer <- tip[,2]/tip[,1]*100
ggplot(tip, aes(x=intratrip,y=tipPer)) +
labs(x=" ", y="tip % ")+
geom_boxplot(aes(fill=intratrip),outlier.size = 0.1)+ scale_fill_brewer(palette="Paired")
The median tip % is lower for the intra-borough trips although as seen in previous figure the tip amount is largest.
We can also evaluate if the tip % changes with the time the taxi is picked up. It doesn’t seem to be important differences.
## tip percentage
tip <- data[,c(11,14,25)]
tip$tipPer <- tip[,2]/tip[,1]*100
ggplot(tip, aes(x=t_puH,y=tipPer)) +
labs(x=" ", y="tip % ")+
geom_boxplot(aes(fill=t_puH),outlier.size = 0.1)+ scale_fill_brewer(palette="Paired")
library(GGally)
ggpairs(data[,c(5,11,14,17,18)], aes(colour = data$t_puH, alpha = 0.4), upper = list(continuous = wrap("cor", size=2.5)),lower = list(continuous = wrap(funcVal="points", alpha = 1, size=0.5)))
Main differences are found in the time length of the trips, where the daytime records show data where the same distance takes longer in comparison with night-time recors. These 2 variables are the ones that correlates less, although night data correlates better.
ggpairs(data[, c(5,11,14,17,18)], aes(colour = data$PU_servicezone, alpha = 0.4), upper = list(continuous = wrap("cor", size=2.5)),lower = list(continuous = wrap(funcVal="points", alpha = 1, size=0.5)))
Also we observe how most of the data of airports have a different distribution compared to the other areas. In general, trips from the airport are more expensive an takes more time. Aditionally the tip amount is higher. If we check if this also happends with the tip %:
df <- data
df$tipPer <- data[,14]/data[,11]*100
ggpairs(df[, c(5,11,29,17,18)], aes(colour = df$PU_servicezone, alpha = 0.4), upper = list(continuous = wrap("cor", size=2.5)),lower = list(continuous = wrap(funcVal="points", alpha = 0.5, size=0.5)))
In this case, the % tip is centered in 20-25% when the taxi is picked up in the Yellow or Boro Zone. In the case of the airports, we see a bimodal distribution, with a second maximum beyond 25%. However, it is important to notice that the amount of data from boroughs different to Manhattan is scarce.
data$tipPer <-data[,14]/data[,11]*100
PU_tip<- data %>%
group_by(PULocationID) %>%
summarise(no_rows = mean(tipPer)) %>%
as.data.frame()
names(PU_tip) <- c("LocationID","PU_mean")
DO_tip<- data %>%
group_by(DOLocationID) %>%
summarise(no_rows = mean(tipPer)) %>%
as.data.frame()
names(DO_tip) <- c("LocationID","DO_mean")
d <- merge(spZones, PU_tip, by='LocationID')
pu_tip<- ggplot() + geom_sf(data=d, aes(fill = PU_mean))+scale_fill_continuous(type='viridis')
e <- merge(spZones, DO_tip, by='LocationID')
do_tip<- ggplot() + geom_sf(data=e, aes(fill = DO_mean))+scale_fill_continuous(type='viridis')
pu_tip
Most of the locations are around 20 to 23 but few of them present values below 15 and above 25.
We can present the same as previous figure but depending on the drop off location.
do_tip
In this cse there is one location with high values with respect to the others. In general there is homogenity.
PU_length <- data %>%
group_by(PULocationID) %>%
summarise(no_rows = mean(trip_distance))%>%
as.data.frame()
names(PU_length) <- c("LocationID","PU_length")
DO_length <- data %>%
group_by(DOLocationID) %>%
summarise(no_rows = mean(trip_distance))%>%
as.data.frame()
names(DO_length) <- c("LocationID","DO_length")
For the pick up location:
f <- merge(spZones, PU_length, by='LocationID')
pu_length <- ggplot() + geom_sf(data=f, aes(fill = PU_length))+scale_fill_continuous(type='viridis')
g <- merge(spZones, DO_length, by='LocationID')
do_length <- ggplot() + geom_sf(data=g, aes(fill = DO_length))+scale_fill_continuous(type='viridis')
pu_length
The trip length maximum is around 8 miles. Most of the trips are around 4. Inside Manhattan trip length is around 2 miles.
And the drop off:
do_length
The same map representation but depdending on the drop off shows different values. It seems to be a cluster centered in Manhattan with the lower values and largest values goes to the margins of the figure (in general). This has sense if we consider that most of the trips comes from the Manhattan area, so the distance increases for the locations far from that borough.
Some insights can be derived from the dataset of the taxi data:
About the cleaning process:
About the exploratory analysis:
The data is download from the website. As in the exploratory analysis, a sample of the data is used, due to the computational limitations.
After reading the data, it is cleaned taking into account the findings of the exploratory data analysis and some new variables are included:
In the cleaning phase:
The other variables included are: * Duration of the trip * The day of the week * The ‘intratrip’ variable
The prepared data is stored as a .csv called taxi_model_sample_data.csv. The script where this is done is called: model_sample.R
The dataset is then divided: 80% of the data is used as training data and the other 20% is used to test the model.
data <- read.csv("data/taxi_model_sample_data.csv")
#########################
## sample the data for modelling:
# Random sample indexes
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
test_index <- setdiff(1:nrow(data), train_index)
# Build X_train, y_train, X_test, y_test
X_train <- data[train_index, -3]
y_train <- data[train_index, "tip_amount"]
X_test <- data[test_index, -3]
y_test <- data[test_index, "tip_amount"]
## save these datasets:
write.csv(X_train, "data/xtrain.csv", row.names=FALSE)
write.csv(y_train, "data/ytrain.csv",row.names=FALSE)
write.csv(X_test, "data/xtest.csv",row.names=FALSE)
write.csv(y_test, "data/ytest.csv",row.names=FALSE)
Code for the model run can be found in ‘model_run.R’, but I will reproduce it here.
## Built and run the model
## As seen in the EDA there is linear relationship among the numeric variables selected. Some of them are linear dependent on other, what should be seen in our model.
## Load the train data:
xtrain <- read.csv("data/xtrain.csv")
ytrain <- read.csv("data/ytrain.csv")
train <- cbind(ytrain, xtrain)
names(train) <- c("tip_amount",names(xtrain))
We first test a multivariate linear regression model. The numerical variables include will be the fare amount, the trip duration, the trip distance. Additionally we will include 2 categorical variables: the day of the week and if the trip is inside the same borough or not (the ‘intratrip’ variable).
Run the model
## 1.1 model: use all the variables except the total amount that depends on the dependent variable 'tip'.
train <- train[, -4]
lm <- lm(tip_amount~. ,data=train)
summary(lm)
##
## Call:
## lm(formula = tip_amount ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5601 -0.0986 0.1077 0.1908 5.0410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.131570 0.011513 11.428 < 2e-16 ***
## trip_distance -0.008293 0.006157 -1.347 0.178022
## fare_amount 0.199459 0.003133 63.666 < 2e-16 ***
## trip_timelength 0.001510 0.001151 1.312 0.189359
## intratripyes 0.208733 0.007652 27.279 < 2e-16 ***
## weekdayjueves 0.028564 0.010169 2.809 0.004971 **
## weekdaylunes 0.041080 0.010647 3.858 0.000114 ***
## weekdaymartes 0.036796 0.010574 3.480 0.000502 ***
## weekdaymiércoles 0.032591 0.010249 3.180 0.001473 **
## weekdaysábado -0.009222 0.010800 -0.854 0.393161
## weekdayviernes 0.016534 0.010188 1.623 0.104617
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6531 on 105364 degrees of freedom
## Multiple R-squared: 0.8625, Adjusted R-squared: 0.8625
## F-statistic: 6.611e+04 on 10 and 105364 DF, p-value: < 2.2e-16
##confidence <- confint(lm)
lm$coefficients
## (Intercept) trip_distance fare_amount trip_timelength
## 0.131570301 -0.008292670 0.199458921 0.001510169
## intratripyes weekdayjueves weekdaylunes weekdaymartes
## 0.208733226 0.028564056 0.041079572 0.036795800
## weekdaymiércoles weekdaysábado weekdayviernes
## 0.032591283 -0.009221740 0.016533708
The training show a good performance of the linear regression. We can check for the importance of the different variables in the summary of the model.
Test the model
xtest <- read.csv("data/xtest.csv")
xtest <- xtest[,-3] ## remove the total amount
ytest <- read.csv("data/ytest.csv")
predictions <- predict(lm, xtest, interval="prediction")
results <- cbind(ytest, predictions)
The xyplot for the test and predicted:
plot(results$x~results$fit, xlab='predictions', ylab='test', cex=0.5)
abline(0,1, col='red')
We can check the predictions and we can also see that the errors are normally distributed.
## Normal distribution of residuals.
histogram(lm$residuals, par.settings=myTheme, xlab='residuals')
## RMSE
RMSE_lm <- sqrt(mean((results$x - results$fit)^2))
RMSE_lm
## [1] 0.6492917
However, the variance of the residuals is not constant, which means that the linear model may not be the best solution to this problem. It could be that we are missing some predictors or that some of the data collectd (those records with constant tip regardless the fare) cannot be used in the model.
plot(lm$residuals~lm$fitted.values, cex=0.5, xlab='fit values', ylab='residuals')
abline(c(0,0), col='red')
We now check a random forest in order to check if it improves the predictions of the multilinear model.
Run the model
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# It takes long
rf_model <- randomForest(tip_amount ~., data = train, importance = TRUE, ntree=300)
rf_model
##
## Call:
## randomForest(formula = tip_amount ~ ., data = train, importance = TRUE, ntree = 300)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.4868762
## % Var explained: 84.31
importance(rf_model)[,1] ## show the importance of the variables
## trip_distance fare_amount trip_timelength intratrip weekday
## 19.4983466 20.1764098 19.8973713 12.8809826 0.4999912
Test the model
rf_predict <- predict(rf_model, xtest)
results <- cbind(ytest, rf_predict)
plot(results$x~results$rf_predict, xlab='predictions', ylab='test', cex=0.5)
abline(0,1, col='red')
We can compute the error and RMSE as well.
RMSE_rf <- sqrt(mean((results$x-results$rf_predict)^2))
RMSE_rf
## [1] 0.695093
The relative error can be represented. Most of the time the error is small, but there are few predictions that are far from the real value.
err <- round((abs(results$x-results$rf_predict)/results$x)*100, digits=1)
histogram(err, par.settings=myTheme)
The random forest approach gives similar results to the multivariate regression model.
In order to go beyond this analysis:
We could enrich the analysis with data from every month of the year, in order to investigate seasonality in the data. Up to now only three months have been considered, so the complete annual cycle could be interesting to see if there is seasonality.
A global picture of the whole “taxi” sector is better done with all the datasets from other types of vehicles. In this analysis, only the ‘yellow’ taxi data is used, so it would be better to include the rest of the datasets.
We could also look for other variables that potentially could be interesting for the tip variable. Like knowing about the most dense traffic hours in each area, some weather data to look for patterns in the temporal series, etc.
As seen in the data, the tip amount have a clear linear pattern with different slopes. If it could be done a first classification to separate the lines with different slopes, the regression problem adapted to each case would work better.